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Abstract — In this paper, the problem of training optimization 
for estimating a multiple-input multiple-output (MIMO) flat 
fading channel in the presence of spatially and temporally 
correlated Gaussian noise is studied in an application-oriented 
setup. So far, the problem of MIMO channel estimation has 
mostly been treated within the context of minimizing the mean 
square error (MSE) of the channel estimate subject to various 
constraints, such as an upper bound on the available training 
energy. We introduce a more general framework for the task 
of training sequence design in MIMO systems, which can treat 
not only the minimization of channel estimator's MSE, but 
also the optimization of a final performance metric of interest 
related to the use of the channel estimate in the communication 
system. First, we show that the proposed framework can be 
used to minimize the training energy budget subject to a quality 
constraint on the MSE of the channel estimator. A deterministic 
version of the "dual" problem is also provided. We then focus 
on four specific applications, where the training sequence can 
be optimized with respect to the classical channel estimation 
MSE, a weighted channel estimation MSE and the MSE of the 
equalization error due to the use of an equalizer at the receiver 
or an appropriate linear precoder at the transmitter. In this way, 
the intended use of the channel estimate is explicitly accounted 
for. The superiority of the proposed designs over existing methods 
is demonstrated via numerical simulations. 

Index Terms — Channel equalization, L-optimality criterion, 
MIMO channels, system identification, training sequence design. 



I. Introduction 

AN important factor in the performance of multiple an- 
tenna systems is the accuracy of the channel state infor- 
mation (CSI) [1]. CSI is primarily used at the receiver side 
for purposes of coherent or semi-coherent detection, but it 
can be also used at the transmitter side, e.g., for precoding 
and adaptive modulation. Since in communication systems 
the maximization of spectral efficiency is an objective of 
interest, the training duration and energy should be minimized. 
Most current systems use training signals that are white, both 
spatially and temporally, which is known to be a good choice 
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according to several criteria [2], [3]. However, in case that 
some prior knowledge of the channel or noise statistics is 
available, it is possible to tailor the training signal and to obtain 
a significantly improved performance. Especially, several au- 
thors have studied scenarios where long-term CSI in the form 
of a covariance matrix over the short-term fading is available. 
So far, most proposed algorithms have been designed to 
minimize the squared error of the channel estimate, e.g., [4]- 
[9]. Alternative design criteria are used in [5] and [10], where 
the channel entropy is minimized given the received training 
signal. In [11], the resulting capacity in the case of a single- 
input single-output (SISO) channel is considered, while [12] 
focuses on the pairwise error probability. 

Herein, a generic context is described, drawing from similar 
techniques that have been recently proposed for training signal 
design in system identification [13]— [15]. This context aims 
at providing a unified theoretical framework, that can be 
used to treat the MIMO training optimization problem in 
various scenarios. Furthermore, it provides a different way 
of looking at the aforementioned problem, that could be 
adjusted to a wide variety of estimation-related problems in 
communication systems. First, we show how the problem of 
minimizing the training energy subject to a quality constraint 
can be solved, while a "dual" deterministic (average design) 
problem is considered^. In the sequel, we show that by a 
suitable definition of the performance measure the problem 
of optimizing the training for minimizing the channel MSE 
can be treated as a special case. We also consider a weighted 
version of the channel MSE, which relates to the well- 
known L-optimality criterion [17]. Moreover, we explicitly 
consider how the channel estimate will be used and attempt 
to optimize the end performance of the data transmission, 
which is not necessarily equivalent to minimizing the mean 
square error (MSE) of the channel estimate. Specifically, we 
study two uses of the channel estimate: channel equalization 
at the receiver using a minimum mean square error (MMSE) 
equalizer and channel inversion (zero-forcing precoding) at 
the transmitter, and derive the corresponding optimal training 
signals for each case. In the case of MMSE equalization, 
separate approximations are provided for the high and low 
SNR regimes. Finally, the resulting performance is illustrated 
based on numerical simulations. Compared to related results 
in the control literature, here we directly design a finite-length 
training signal and consider not only deterministic channel 

'The word "dual" in this paper defers from the Lagrangian duality studied 
in the context of convex optimization theory. See [16] for more details on this 
type of duality. 
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parameters, but also a Bayesian channel estimation framework. 
A related pilot design strategy has been proposed in [18] for 
the problem of jointly estimating the frequency offset and the 
channel impulse response in single antenna transmissions. 

Implementing an adaptive choice of pilot signals in a 
practical system would require a feedback signalling overhead, 
since both the transmitter and the receiver have to agree 
on the choice of the pilots. Just as previous studies in the 
area, the current paper is primarily intended to provide a 
theoretical benchmark on the resulting performance of such 
a scheme. Directly considering the end performance in the 
pilot design is a step into making the results more relevant. 
The data model used in [4]-[10] is based on a questionable 
assumption, namely that the channel is frequency flat, but 
that the noise is allowed to be frequency selective. Such an 
assumption might be relevant in systems that share spectrum 
with other radio interfaces using a narrower bandwidth and 
possibly in situations where channel coding introduces a 
temporal correlation in interfering signals. In order to focus 
on the main principles of our proposed strategy and to keep 
the mathematical derivations as simple as possible, the same 
model has been used in the current paper. 

As a final comment, the novelty of this paper is on in- 
troducing the application-oriented framework as the appro- 
priate context for training sequence design in communication 
systems. To this end, Hermitian form-like approximations of 
performance metrics are addressed here because they usually 
are good approximations of many performance metrics of 
interest, as well as, for simplicity purposes and comprehen- 
siveness of presentation. To illustrate the framework, we have 
for simplicity chosen to study performance metrics related to 
the MSE of the information carrying signal after equalization. 
Directly designing for performance metrics like bit error 
rate (BER) would be even more relevant but would involve 
more technical complications. Also, the BER is with good 
approximation monotonically increasing in the MSE of the 
input to the detector and we illustrate numerically that our 
design outperforms previous state-of-the-art also in terms of 
BER. 

This paper is organized as follows: Section [II] introduces the 
basic MIMO received signal model and specific assumptions 
on the structure of channel and noise covariance matrices. 
Section [III] presents the optimal channel estimators, when the 
channel is considered to be either a deterministic or a random 
matrix. Section [IV] presents the application-oriented optimal 
training designs in a guaranteed performance context, based 
on confidence ellipsoids and Markov bound relaxations. More- 
over, Section [V] focuses on four specific applications, namely 
that of MSE channel estimation, channel estimation based on 
the L-optimality criterion and finally channel estimation for 
MMSE equalization and ZF precoding. Numerical simulations 
are provided in Section [VT] while Section IVIII concludes this 
paper. 

Notations: Boldface (lower case) is used for column vec- 
tors, x, and (upper case) for matrices, X. Moreover, X T , X ff , 
X* and X^ denote the transpose, the conjugate transpose, 
the conjugate and the Moore-Penrose pseudoinverse of X, 
respectively. The trace of X is denoted as tr(X) and A >z B 



means that A B is positive semidefinite. vec(X) is the 
vector produced by stacking the columns of X, and (X.)i.j 
is the (i,j)-th element of X. [X]+ means that all negative 
eigenvalues of X are replaced by zeros (i.e., [X]+ >: 0). 
CA/"(x, Q) stands for circularly symmetric complex Gaussian 
random vectors, where x is the mean and Q the covariance 
matrix. Finally, a! denotes the factorial of the nonnegative 
integer a and mod(a, b) the modulo operation between the 
integers a, b. 

II. System Model 

We consider a MIMO communication system with nr 
antennas at the transmitter and ur antennas at the receiver. 
The received signal at time t is modelled as 

y(t) = Hx(t) + n(t) 

where x(t) £ C™ T and y(t) £ C nR are the baseband repre- 
sentations of the transmitted and received signals, respectively. 
The impact of background noise and interference from adja- 
cent communication links is represented by the additive term 
n(t) £ C ,lR . We will further assume that x(t) and n(i) are 
independent (weakly) stationary signals. The channel response 
is modeled by H £ C" hX " t , which is assumed constant 
during the transmission of one block of data and independent 
between blocks; that is, we are assuming frequency flat block 
fading. Two different models of the channel will be considered: 

i) A deterministic model. 

ii) A stochastic Rayleigh fading modeH, i.e., vec(H) £ 
CJ\f(Q, R), where, for mathematical tractability, we will 
assume that the known covariance matrix R possesses 
the Kronecker model used, e.g., in [7], [10]: 

R = R£<g>R fl (1) 

where R T £ C" tX " t and K R £ C rlRXrlR are the spatial 
covariance matrices at the transmitter and receiver side, 
respectively. This model has been experimentally verified 
in [19], [20] and further motivated in [21], [22]. 
We consider training signals of arbitrary length B, repre- 
sented by P £ C" txB , whose columns are the transmitted 
signal vectors during training. Placing the received vectors in 
Y = [y(l) . . . y(B)] £ C nRXB , we have 

Y = HP + N, 

where N = [n(l) ... n(B)] £ C nRXB is the combined 
noise and interference matrix. 

Defining P = P T ® I, we can then write 

vec(Y) = P vcc(H) + vcc(N) . (2) 

As, for example, in [7], [10], we assume that vec(N) £ 
CAf(0, S), where the covariance matrix S also possesses a 
Kronecker structure 

S = Sg ® S R . (3) 

2 For simplicity, we have assumed a zero-mean channel, but it is straight- 
forward to extend the results to Rician fading channels, similarly to [9]. 
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Here, Sq € C BxB represents the temporal covariance matrix^ IV. Application-Oriented Optimal Training Design 



and G C" iiX,1,i represents the received spatial covariance 
matrix. 

The channel and noise statistics will be assumed known 
to the receiver during estimation. Statistics can often be 
achieved by long-term estimation and tracking [23]. 

For the data transmission phase, we will assume that the 
transmit signal {x(i)} is a zero-mean, weakly stationary 
process, which is both temporally and spatially white, i.e., 
its spectrum is $ x (o;) = A^I. 



III. Channel Matrix Estimation 

A. Deterministic Channel Estimation 

The minimum variance unbiased (MVU) channel estimator 
for the signal model ©, subject to a deterministic channel 
(Assumption i) in Section [IB, is given by [24] 

vec(H M vu) = (P H S- 1 P)- 1 P H S" 1 vec(Y). (4) 
This estimate has the distribution 

vec(H M vu) G CAA(vec(H),X^ vu ), (5) 
where X f ,mvu is the inverse covariance matrix 



MVU 



(6) 



From this, it follows that the estimation error H = Hmvu — H 
will, with probability a, belong to the uncertainty set 

Z>z> = jfi : vcc h (H)X f ,mvuvcc(H) < \p? a (2n T n R ) 

'(7) 

where Xa( n ) ^ s me a percentile of the x 2 ( n ) distribution [15]. 

B. Bayesian Channel Estimation 

For the case of a stochastic channel model (Assumption ii) 
in Section [TTJ>, the posterior channel distribution becomes (see 
[24]) 

vec(H)|Y, P G CM (vgc(Hmmse): C M mse), (8) 
where the first and second moments are 
vcc(Hmmse) = (R -1 +P ff S- 1 P)- 1 P H S- 1 vcc(Y 
Cmmse = (R 1 +P ff S X P) 1 . 



(9) 



Thus, the estimation error H = Hmmse — H will, with 
probability a, belong to the uncertainty set 

V B = |h : vec ff (H)X F ,MMS E vec(H) < ^(2^)1 , 

(10) 

where X F mmse — C M mse is the inverse covariance matrix in 
the MMSE case [15]. 

3 We set the subscript Q to Sq to highlight its temporal nature and the fact 
that its size is B X B. The matrices with subscript X in this paper share the 
common characteristic that they are nj X tit, while those with subscript R 
are n R X n R . 



In a communication system, an estimate of the channel, say 
H, is needed at the receiver to detect the data symbols and may 
also be used at the transmitter to improve the performance. Let 
J(H, H) be a scalar measure of the performance degradation 
at the receiver due to the estimation error H for a channel H. 
The objective of the training signal design is then to ensure 
that the resulting channel estimation error H is such that 

J(H,H)<- (11) 

7 

for some parameter 7 > 0, which we call accuracy. In our 
settings, (fTTT i can not be typically ensured, since the channel 
estimation error is Gaussian distributed (see (|5) and (|8)) and, 
therefore, can be arbitrarily large. However, for the MVU 
estimator @, we know that, with probability a, H will belong 
to the set T> jj defined in (jTj. Thus, we are led to training signal 
designs which guarantee (fTTT i for all channel estimation errors 
H G T>r>. One training design problem that is based on this 
concept is to minimize the required transmit energy budget 
subject to this constraint 



DGPP : minimize tr(PP H ) 

p eC r» T xB 

s.t. J(H,H) < I VH G V D . 



(12) 



Similarly, for the MMSE estimator in Subsection IIII-BI the 
corresponding optimization problem is given as follows 

SGPP : minimize tr(PP H ) 

s.t. J(H,H) < i VH G V B , 

where V B is defined in ( fTOb . We will call ( fT2b and ( fT3l ), 
the deterministic guaranteed performance problem (DGPP) 
and the stochastic guaranteed performance problem (SGPP), 
respectively. An alternative, "dual", problem is to maximize 
the accuracy 7 subject to a constraint V > on the transmit 
energy budget. For the MVU estimator this can be written as 



DMPP 



maximize 

p gC n T xB 
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s.t. J(H,H) < i VHeP 
tr(PP H ) < V. 



D- 



(14) 



We will call this problem the deterministic maximized perfor- 
mance problem (DMPP). The corresponding Bayesian prob- 
lem will be denoted as the stochastic maximized performance 
problem (SMPP). We will study the DGPP/SGPP in detail 
in this contribution, but the DMPP/SMPP can be treated in 
similar ways. In fact, Theorem 3 in [16] suggests that the 
solutions to the DMPP/SMPP are the same as for DGPP/SGPP, 
save for a scaling factor. 

The existing work on optimal training design for MIMO 
channels are, to the best of the authors knowledge, based 
upon standard measures on the quality of the channel estimate, 
rather than on the quality of the end-use of the channel. The 
framework presented in this section can be used to treat the 
existing results as special cases. Additionally, if an end perfor- 
mance metric is optimized, the DGPP/SGPP and DMPP/SMPP 
formulations better reflect the ultimate objective of the training 
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design. This type of optimal training design formulations has 
already been used in the control literature, but mainly for 
large sample sizes [13], [14], [25], [26], yielding an enhanced 
performance with respect to conventional estimation-theoretic 
approaches. A reasonable question is to examine if such a 
performance gain can be achieved in the case of training 
sequence design for MIMO channel estimation, where the 
sample sizes would be very small. 

Remark: Ensuring (fTTT i can be translated into a chance 
constraint of the form 



Pr<j J(H,H) < -} >l-e 



(15) 



for some e G [0,1]. Problems ([T2l . ([TBI and (TT4l correspond 
to a convex relaxation of this chance constraint based on 
confidence ellipsoids [27], as we show in the next subsection. 

A. Approximating the Training Design Problems 

A key issue regarding the above training signal design prob- 
lems is their computational tractability. In general, they are 
highly non-linear and non-convex. However, for performance 
metrics that are sufficiently smooth functions of the estimation 
error and have a minimum when the estimation error is zero, 
Taylor's theorem shows that they can be well approximated by 
a constant plus a quadratic term in H. Therefore, we consider 
performance metrics that can be approximated by 



J(H,H) a vec ff (H)X adm vec(H). 



(16) 



For mathematical tractability, we will further assume that the 
Hermitian positive definite matrix X a d m can be written in 
Kronecker product form as X^®X R for some matrices Xp and 
Xr. In Section[Vj we will show several examples of practically 
relevant performance metrics that can be approximated in this 
form. This means that we can approximate the set {H : 
J(H, H) < 1/7} of all admissible estimation errors H by 
a (complex) ellipsoid in the parameter space 

V adm = {H : vcc H (H) 7 X adm vec(H) < 1}. (17) 
Consequently, the DGPP ( [T2l can be approximated by 
ADGPP : minimize tr(PP H ) 
s.t. V D C V adm . 

We call this problem the approximative DGPP (ADGPP). Both 
T>d and T) a d m are level sets of quadratic functions of the 
channel estimation error. Rewriting (|7) so that we have the 
same level as in (T% , we obtain 

V D = {il : vec*(H) 2 2 ^ MVU , vcc(H)<l). 

Comparing this expression with ([P71 i gives that T>d Q T) a dm 
if and only if 



2X F , 



MVU 



ad m 



xl(2n T n R ) 

(for a more general result see [15, Theorem 3.1]). 



When X adm has the form X at ] m = X^ ® Xp, with Xp g 
C n T xn T and Xr g c » R xn R) the ADGPP <Q3) can then be 

written as 



minimize tr(PP ff ) 

p 6 C"r x - B 

S.t. 5Hc-ll' 



7X t> (2n r rt_ R ) 
2 



X T ® X R . 



(19) 



Similarly, by observing that V a d m only depends on the channel 
estimation error, and following the derivations above, the 
SGPP can be approximated by the following formulation 



minimize tr(PP ff ) 

p gC n T Xf3 



S.t. 



R 



pH g lp v_ 7X q (2»t"r) jT 
✓ 2 T 



iX fl . 
(20) 

We call the last problem approximative SGPP (ASGPP). 

Remarks: 

1) Several examples of the approximation ([Tol l are pre- 
sented in Section [V] The approximation ([Tol l is not 
possible for the performance metric of every application. 
Therefore, in some applications, alternative convex ap- 
proximations of the corresponding performance metrics 
may have to be found. 

2) The quality of the approximation ( fT6b is characterized 
by its corresponding tightness to the true performance 
metric. For our purposes, when the tightness of the 
aforementioned approximation is acceptable, such an ap- 
proximation will be desirable because it corresponds to 
a Hermitian form, therefore offering nice mathematical 
properties and tractability. 

3) The sizes of T>d and T> a d m critically depend on the 
parameter a. In practice, requiring a to have a value 
close to 1 corresponds to adequately representing the 
uncertainty set in which (approximately) all possible 
channel estimation errors he. 

B. The Deterministic Guaranteed Performance Problem 

The problem formulations for ADGPP and ASGPP in (O 
and d20l , respectively, are similar in structure. The solutions 
to these problems (and to other approximative guaranteed 
performance problems) can be obtained from the following 
general theorem. 

Theorem 1: Consider the optimization problem 



minimize tr(PP ff ) 

s.t. PA X P H >- B 



(21) 



where A e C NxN is Hermitian positive definite, B e C nxn 
is Hermitian positive semi-definite, and > rank(B). An 
optimal solution to (f2Tb is 



P opt = U B DpUf (22) 

lxN is a rectangular diagonal matrix with 
v/QDa)i,i(Db)i,i, ■ • ■ , \/(Da )m,m (D B ) m , m on the main 
diagonal. Here, m = mm(n,N), while and XJb are 



where Dp e 
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unitary matrices that originate from the eigendecompositions 
of A and B, respectively, i.e., 

1 (23) 
B = UpDpUf 

and D^Dp are real-valued diagonal matrices, with their 
diagonal elements sorted in ascending and descending order, 
respectively; that is, < (Da)i,i < ••• < (Da)n,n and 
(D B )i,i > ... > (D B ) n , n > 0. 

If the eigenvalues of A and B are distinct and strictly pos- 
itive, then the solution ( f22b is unique up to the multiplication 
of the columns of and Up by complex unit-norm scalars. 
Proof: The proof is given in Appendix [II] ■ 

By the right choice of A and B, Theorem Q] will solve the 
ADGPP in ( fT9b . This is shown by the next theorem (recall 
that we have assumed that S = Sq <Ei Sp). 

Theorem 2: Consider the optimization problem 

minimize tr(PP ff ) 

PeC"T xB (24) 
s.t. P H (Sg®S K )- 1 P )^cX T T ®X R 

where P = P T ® I, S Q G C BxB , Sp G C"« x,l « are Hermi- 
tian positive definite, and X T G C" tX " t , I r G C" r>< "« are 
Hermitian positive semi-definite, and c is a positive constant. 

If B > rank(Xp), this problem is equivalent to (ISTT i in 
Theorem Q] for A = Sq and B = cA max (SpXp)Xp, where 
Amax(') denotes the maximum eigenvalue. 

Proof: The proof is given in Appendix [TTTJ ■ 

C. The Stochastic Guaranteed Performance Problem 

Next, we will see that Theorem Q] can be also used to solve 
the ASGPP in ( l20l i. In order to obtain closed-form solutions, 
we need some equality relation between the Kronecker blocks 
of R = <£> Rp and of either S = Sq <g> Sp or X a< j m = 
X T (8) Xp. For instance, it can be Rp = Sp, which may be 
satisfied if the receive antennas are spatially uncorrected or if 
the signal and interference are received from the same main 
direction. See [7] for details on the interpretations of these 
assumptions. 

The solution to ASGPP in ( f20b is given by the next theorem. 

Theorem 3: Consider the optimization problem 
minimize tr(PP ff ) 

PeC"T xB (25) 
s.t. R 1 + P H S X P y cX? (g) Xp 

where P = P T ® I, R = R£ ® Rp, and S = Sg <g> Sr. 
Here, R T G C" rX " r , Rp G C"« x ™«, Sq g C ByB , S R G 
C n R xn R ^ Hermit i an p 0S itive definite, and X T G C" tX " t , 
Xp G C" bX " h are Hermitian positive semi-definite, and c is 
a positive constant. 

. If Rp = S R and B > rank([cA max (SpXp)X T -R T 1 ] + ), 

then the problem is equivalent to ( f2TT > in Theorem [T] for 

A = S Q and B = [cA max (S i? X J? )X T - R^ 1 ] + . 
. If R^ 1 = X R and B > rank([cX T - R^ 1 ] + ), then the 

problem is equivalent to (I2TI 1 in Theorem [T] for A = Sq 

and B = X max (S R I R )[cX T - R^ 1 ]+. 



• If Hj, = Xt and B > rank(Xj-), then the problem 
is equivalent to ( f2TT > in Theorem [1] for A = Sq and 

B = X m:ix (S R [cX R — R, R ] + )X T . 

Proof: The proof is given in Appendix [III] ■ 
The mathematical difference between ADGPP and ASGPP 
is the R _1 term that appears in the constraint of the latter. 
This term has a clear impact on the structure of the optimal 
ASGPP training matrix. 

It is also worth noting that the solution for R# = S R 
requires B > rank([cA ma x(Si?X/?)XT — R^ 1 ] + ) which means 
that solutions can be achieved also for B < ut (i.e., when 
only the B < nx strongest eigendirections of the channel 
are excited by training). In certain cases, e.g., when the 
interference is temporally white (Sq = I), it is optimal to 
have B = rank([cA max (S^X/f)XT — R^ 1 ]+) as larger B will 
not decrease the training energy usage, cf. [9]. 

D. Optimizing the Average Performance 

Except from the previously presented training designs, the 
application-oriented design can be alternatively given in the 
following deterministic "dual" context. If H is considered to 
be deterministic, then we can setup the following optimization 
problem 

minimize < J(H, H) > 
s.t. tr(PP H ) < V. 
Clearly, for the MVU estimator 

E S { J(H,H)} = tr {X^P^S- 1 ?)- 1 } , 

so problem (|26| | is solved by the following theorem. 
Theorem 4: Consider the optimization problem 

minimize tr \x, Am (V H S- 1 ?)- 1 } 

PeC "TX-B I J (27) 

s.t. tr(PP^) < V 

where X a d m = X^ <g) X R as before. Set X' T = X^ = 
U T D T Uf and S' Q = Sg = UqDqU^. Here, U T G 
C" TX,lT , Uq G C BxB are unitary matrices and D t ,Dq 
are diagonal t%t x tit and B x B matrices containing 
the eigenvalues of X' T and S'q in descending and ascend- 
ing order, respectively. Then, the optimal training matrix 
P equals (UrDpUg) , where Dp is an np x B diago- 
nal matrix with main diagonal entries equal to (Dp).;,; = 

JVy/ai/ YJjZi y/aj, i = 1,2,..., tit (B > n T ) and a t = 
(DT)i,i(Dg)t,i)' = 1,2, ...,np with the aforementioned 
ordering. 

Proof: The proof is given in Appendix [IV] ■ 
Remarks: 

1) In the general case of a non Kronecker-structured X a dm, 
the solution of the different designs, ([T9l l, (l20b and 
d27] i can be obtained using numerical methods like the 
semide finite relaxation approach described in [28]. 

2) If Xadm depends on H, then in order to implement 
this design, the embedded H in X a d m may be replaced 
by a previous channel estimate. This implies that this 
approach is possible whenever the channel variations 
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allow for such a design. This observation also applies to 
the designs in the previous subsections. See also [16], 
[29], where the same issue is discussed for other system 
identification applications. 
The corresponding performance criterion for the case of the 
MMSE estimator is given by 

E h,h { H)} = tr {Z^R- 1 + P^S- 1 ?)- 1 } . 

In this case, we can derive closed form expressions for the 
optimal training under assumptions similar to those made in 
Theorem [3] We therefore have the following result: 
Theorem 5: Consider the optimization problem 

tr{x adm (R- 1 - 
tr(PP ff ) < V 



minimize 
s.t. 



P"S 



P)- 1 } 



(28) 



Xfl as before. 



Set = 



where X a[ j m = I 

VqAqVq. Here, we assume that Vq € C"^" is a unitary 
matrix and Aq a diagonal B x B matrix containing the 
eigenvalues of Sq in arbitrary order. Assume also that = 
R^ with eigenvalue decomposition U^A^U^? . The diagonal 
elements of A^ are assumed to be arbitrarily ordered. Then, 
we have the following cases 

> Rfl = S#: We further discriminate two cases 

- It = I: Then the optimal training is given by a 
straightforward adaptation of Proposition 2 in [8]. 

- Ry 1 = It' Then, the optimal training ma- 
trix P equals (Uy(7r op t)DpV^(tu opt ))*, where 
■""opt j w op t stand for the optimal orderings of the 
eigenvalues of R^ and Sg, respectively. These op- 
timal orderings are determined by Algorithm Q~] in 
Appendix [V] Additionally, define the parameter m* 
as in eq. d69l (see Appendix [V}. Assuming in the 
following that, for simplicity of notation, (A' T )i/s 
and (Ag)i i's have the optimal ordering, the optimal 
(Dp)j j,j = 1, 2, ... , m» are given by the expres- 
sion 



(Aq),,, 
JAj, 

(AQ) 
(A' ) 



1, 



,n T . 



while (Dp)j .j = for j = to* 
Proof: The proof is given in Appendix [V] ■ 
Remarks: Two interesting additional cases complementing 
the last theorem are the following: 

1) If the modal matrices of Rp and S r are the same, It = 
I and In = I, then the optimal training is given by [9]. 

2) In any other case (e.g., if R# ^ Sr), the training can be 
found using numerical methods like the semidefinite re- 
laxation approach described in [28]. Note again that this 
approach can also handle general I a dm, not necessarily 
expressed as 1^ ® Ir. 

As a general conclusion, the objective function of the dual 
deterministic problems presented in this subsection can be 
shown to correspond to Markov bound approximations of the 
chance constraint (fl51 l. According to the analysis in [27], these 
approximations should be tighter than the approximations 



based on confidence ellipsoids presented in Subsections IIV-AI 
IIV-BI and IIV-CI for practically relevant values of e. 

V. Applications 

A. Optimal Training for Channel Estimation 

We now consider the channel estimation problem in its 
standard context, where the performance metric of interest 
is the (mean) square error of the corresponding channel 
estimator. Linear estimators for this task are given by (0), (O. 
The performance metric of interest is 

J(H,H) = vec ff (H)vec(H), 

which corresponds to X a d m = I, i.e., to It = I and Ir = I. 
The ADGPP and ASGPP are given by ([T3 and ([20}, respec- 
tively, with the corresponding substitutions. Their solutions 
follow directly from Theorems |2] and [3] respectively. To the 
best of the authors' knowledge, such formulations for the 
classical MIMO training design problem are presented here for 
the first time. Furthermore, solutions to the standard approach 
of minimizing the channel MSE subject to a constraint on the 
training energy budget are provided by Theorems 0] and |5] as 
special cases. 

Remark: Although the confidence ellipsoid and Markov 
bound approximations are generally different [27], in the 
simulation section we show that their performance is almost 
identical for reasonable operating 7-regimes in the specific 
case of standard channel estimation. 

B. Optimal Training for the L-Optimality Criterion 
Consider now a performance metric of the form 

Jiy(H,H) = V ec ff (H)Wvec(H), 

for some positive semidefinite weighting matrix W. Assume 
also that W = Wi ® W2 for some positive semidefinite 
matrices W1.W2. Taking the expected value of this perfor- 
mance metric with respect to either H or both H and H 
leads to the well-known L-optimality criterion for optimal 
experiment design in statistics [17]. In this case, It = Wf 
and Iji = W2. In the context of MIMO communication 
systems, such a performance metric may arise, e.g., if we 
want to estimate the MIMO channel having some deficiencies 
in either the transmit and/or the receive antenna arrays. The 
simplest case would be both Wi and W2 being diagonal 
with nonzero entries in the interval [0, 1], Wi representing 
the deficiencies in the transmit antenna array and W2 in the 
receive array. More general matrices can be considered if we 
assume cross-couplings between the transmit and/or receive 
antenna elements. 

Remark: The numerical approach of [28] mentioned after 
Theorems |4] and |5] can handle general weighting matrices W, 
not necessarily Kronecker-structured. 

C. Optimal Training for Channel Equalization 

In this subsection we consider the problem of estimating 
a transmitted signal sequence {x(£)} from the corresponding 
received signal sequence {y(t)}. Among a wide range of 
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methods that are available [30], [31], we will consider the 
MMSE equalizer and for mathematical tractability we will 
approximate it by the non-causal Wiener filter. Note that for 
reasonably long block lengths, the MMSE estimate becomes 
similar to the non-causal Wiener filter [32]. Thus, the optimal 
training design based on the non-causal Wiener filter should 
also provide good performance when using an MMSE equal- 
izer. 

1) Equalization using exact channel state information: 
Let us first assume that H is available. In this ideal case, 
and with the transmitted signal being weakly stationary with 
spectrum & x , the MSE-optimal estimate of the transmitted 
signal x(i) from the received observations of y(t) can be 
obtained according to 

x(t;H)=F( g; H)y(i) (29) 

where q is the unit time shift operator, [<?x(i) = x(i + 1)], and 
the non-causal Wiener filter ~F(e) u \ H) is given by 

F(e-'-H) = ^,H#- 1 H 

= * x (o;)H fl (H^ x (w)H H + *„M) _1 . 

Here, <fr xy (uj) = <&x{w)H H denotes the cross-spectrum be- 
tween x(t) and y(t), and 

$ 9 (w)=H* I (u)H B + $ B (w) (31) 

is the spectral density of y(t). Using our assumption that 
&x(u>) = AaJ, we obtain the simplified expression 

F(e*";H) = U H (HU H + ^(u)/ X^ 1 . (32) 

Remark: Assuming nonsingularity of $ n ((d) for every 10, 
the MMSE equalizer is applicable for all values of the pair 

(n T ,n R ). 

2) Equalization using a channel estimate: Consider now 
the situation where the exact channel H is unavailable, but we 
only have an estimate H. When we replace H by its estimate 
in the expressions above, the estimation error for the equalizer 
will increase. While the increase in the bit error rate would 
be a natural measure of the quality of the channel estimate 
H, for simplicity we consider the total MSE of the difference, 
x(i;H+H)-x(i;H) = A( 9 ;H,H)y(t) (note that H = H + 
H), using the notation A(q: H, H) = F(q; H + H) -F(q; H). 
In view of this, we will use the channel equalization (CE) 
performance metric 

J CB (H, H) = E j [A(g; H, H)y(t)] H [A(q; H, H)y(t)]} 

= E {tr ([A(g; H, H)y(f)] [A( 9 ; H, H)y(t)] H ) } 

= i- |" tr(A( e ^;H,H)* y ( W )A ff (e^;H,H)) dw. 

(33) 

We see that the poorer the accuracy of the estimate, the larger 
the performance metric Jce (H,H) and, thus, the larger the 
performance loss of the equalizer. Therefore, this performance 
metric is a reasonable candidate to use when formulating 
our training sequence design problem. Indeed, the Wiener 
equalizer based on the estimate H = H + H of H can 



be deemed to have a satisfactory performance if Jc_e(H, H) 
remains below some user-chosen threshold. Thus, we will use 
Jce as J in problems (fT2l and $13[ . Though these problems 
are not convex, we show in Appendix |T] how they can be 
convexified, provided some approximations are made. 
Remarks: 

1) The excess MSE Jce(H, H) quantifies the distance of 
the MMSE equalizer using the channel estimate H over 
the clairvoyant MMSE equalizer, i.e., the one using the 
true channel. This performance metric is not the same 
as the classical MSE in the equalization context, where 
the difference x(t; H + H) — x(t) is considered instead 
of x(t; H + H) — x(t;H). However, since in practice 
the best transmit vector estimate that can be attained 
is the clairvoyant one, the choice of Jce(H,H) is 
justified. This selection allows for a performance metric 
approximation given by ( [ToT l. 

2) There are certain cases of interest, where Jce (H, H) 
approximately coincides with the classical equalization 
MSE. Such a case occurs when tir > tit, H is 
full column rank and the SNR is high during data 
transmission. 

D. Optimal training for Zero-Forcing (ZF) Precoding 

Apart from receiver side channel equalization, as another 
example of how to apply the channel estimate we consider 
point-to-point zero-forcing precoding, also known as channel 
inversion [33]. Here the channel estimate is fed back to 
the transmitter and its (pseudo-)inverse is used as a linear 
precoder. The data transmission is described by 

y(t) =H*x(t)+v(t) 

where the precoder is * = fit, i.e., * = H H (HH ff ) _1 if 
we limit ourselves to the practically relevant case tit > tir 
and assume that H is full rank. Note that x(i) is an tir x 1 
vector in this case, but the transmitted vector is 1 4 r x(<), which 
is tit x 1. 

Under these assumptions, and following the same strategy 
and notation as in Appendix HI we get 

y(t; H) - y(t; H) = HH'xfi) + v - (HH t x(f) + v) 

= (HH f - HH' - I)x(t) ~ -HH f x(t) (34) 

Consequently, a quadratic approximation of the cost function 
is given by 

J ZF (H,H) = E{[y(i;H)-y(£;H)] H [y(<;H)-y(t;H)]} 

~ A,vec ff (H) ((H t (H t ) if ) T ®l)vec(H) 

= vec ff (H)(X?®X fl )vec(H), (35) 

if we define X T = X X W{W) H = \ X *K H (HH fl )~ 2 H and 
Ir = I. 

Remark: The cost functions of (|27T i and d28| i reveal the fact 
that any performance-oriented training design is a compromise 
between the strict channel estimation accuracy and the desired 
accuracy related to the end performance metric at hand. 
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n T =4, n R =2, B=6, a-percentile=0.99 



■ - White training 
• -ASGPP 

^—Optimal MMSE in [9] 




Fig. 1. n T = 4, n R = 2, B = 6, a(%) = 99: Channel Estimation NMSE 
based on Subection IV-AI with Rr = S R . 



n T =6, n R =6, B=8, a-percentile=0.99 



MVU 
<i - ADGPP 

▲ - MVU in Subsection IV-D 



» *WL * * .. 



Y(dB) 



Fig. 2. tit = 6, n R = 6, B = 8, a(%) = 99: L-optimality criterion with 
arbitrary but positive-semidefinite Wi , W2 for the MVU estimator. 



Caution is needed to identify cases where the performance- 
oriented design may severely degrade the channel estimation 
accuracy, annihilating all gains from such a design. In the 
case of ZF precoding, if tit > tir, It will have rank at 
most n a yielding a training matrix P with only ur active 
eigendirections. This is in contrast to the secondary target, 
which is the channel estimation accuracy. Therefore, we expect 
ADGPP, ASGPP and the approaches in Subsection ITVT31 
to behave abnormally in this case. Thus, we propose the 
performance-oriented design only when tit = riR in the 
context of the ZF precoding. 

VI. Numerical Examples 

The purpose of this section is to examine the performance 
of optimal training sequence designs, and compare them with 
existing methods. For the channel estimation MSE figure, we 
plot the normalized MSE (NMSE), i.e., E(||H- H|| 2 /||H|| 2 ), 
versus the accuracy parameter 7. In all figures, fair comparison 
among the presented schemes is ensured via training energy 
equalization. Additionally, the matrices Rt, Rr, Sq, Sr fol- 
low the exponential model, that is, they are built according 
to 

(R) ilj =r i - i , j>i, (36) 

where r is the (complex) normalized correlation coefficient 
with magnitude p = \r\ < 1. We choose to examine the high 



n T =3, n R =3, B=4, a-percentile=0.99 



4— Optimal MMSE in [9] 
• - ASGPP 

MMSE in Subsection IV-D 




Y(dB) 



Fig. 3. tit = 3, n R = 3, B = 4, a(%) = 99: L-optimality criterion with 
arbitrary but positive-semidefinite Wi,W2 for the MMSE estimator with 



n T =4, n R =2, B=6, SNR=15dB, n=0.01 




Y(dB) 



Fig. 4. tit = 4, n R = 2, B = 6, SNR = 15dB, /J, 
Channel Equalization with R/j ^ Sr. 



0.01: MMSE 



n T =5, n R =5, B=7, SNR=15 dB, (i=0.01 , a-percentile=0.99 

^ ^ 



MVU 
-<4 - ADGPP 

- ▲ - MVU in Subsection IV-D 



Y (dB) 



Fig. 5. n T = 5,n R = 5,B = 7, SNR = 15dB,a(%) = 99,[i = 0.01: 
ZF precoding based on Subection IV-Dl for the MVU estimator. X at j m is based 
on a previous channel estimate. 
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n T =4, n R =4, B=6, y=5 dB, u=0.01, a-percentile=0.99 
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BER performance using the signal estimates produced by the corresponding 
Fig. 6. n T = 4, n R = 4, B = 6, SN R = 15dB,/x = 0.01, a(%) = 99: schemes in Fig.[6]with Rr = Sr and 7 = 5 dB. X a dm is based on a previous 
ZF precoding MSE based on Subection IV-DI for the MMSE estimator with channel estimate. 
Hr = Sr. X at i m is based on a previous channel estimate. 



n =6, n =6, B=8, y=1 



MVU 
■ -4 - ADGPP 

-A - MVU in Subsection IV D 



V 



Training Energy (dB) 

Fig. 7. tit = 6,71^2 = 6,B = 8,7 = 1: Outage probability for the 
L-optimality criterion with the MVU estimator. The accuracy parameter is 
7 = 1. 



n T =4. n R =4, B=6, y=-10dB, n=0.01, a-percentile=0.99 




SNR (dB) 



Fig. 8. n T = A,n R = A,B = 6,7 = -10dB,/i = 0.01, a(%) = 99: 
BER performance using the signal estimates produced by the corresponding 
schemes in Fig. [6] with Rr = Sr and 7 = —10 dB. I at i m is based on a 
previous channel estimate. 



correlation scenario for all the presented schemes. Therefore, 
in all plots \r\ = 0.9 for all matrices Rr, Rr, Sq, Sr. 
Additionally, the transmit SNR during data transmission is 
chosen to be 15 dB, when channel equalization and ZF 
precoding are considered. High SNR expressions are therefore 
used for optimal training sequence designs. Since the optimal 
pilot sequences depend on the true channel, we have for 
these two applications additionally assumed that the channel 
changes from block to block according to the relationship 
H, = H;_i + /itEj, where Ej has the same Kronecker 
structure as H and it is completely independent from H;_i. 
The estimated Hi 1 is used in the pilot design. In Figs. HIE] 
|U E and the value of fj, is 0.01. 

In Fig. Q] the channel estimation NMSE performance versus 
the accuracy 7 is presented for three different schemes. The 
scheme 'ASGPP' is the optimal Wiener filter together with 
the optimal guaranteed performance training matrix described 
in Subsection IV-AI 'Optimal MMSE in [9]' is the scheme 
presented in [9], which solves the optimal training problem 
for the vectorized MMSE, operating on vcc(Y). This solution 
is a special case in the statement of Theorem [5] for I a dm = I, 
i.e., It = I and Ir = I. Finally, the scheme 'White 
training' corresponds to the use of the vectorized MMSE filter 
at the receiver, with a white training matrix, i.e., one having 
equal singular values and arbitrary left and right singular 
matrices. This scheme is justified when the receiver knows 
the involved channel and noise statistics, but does not want 
to sacrifice bandwidth to feedback the optimal training matrix 
to the transmitter. This scheme is also justified in fast fading 
environments. In Fig. Q] we assume that R# = and we 
implement the corresponding optimal training design for each 
scheme. ASGPP' is implemented first for a certain value of 
7 and the rest of the schemes are forced to have the same 
training energy. The 'Optimal MMSE in [9]' and ASGPP' 
schemes have the best and almost identical MSE performance. 
This indicates that for the problem of training design with 
the classical channel estimation MSE, the confidence ellipsoid 
relaxation of the chance constraint and the relaxation based on 
the Markov bound in Subsection II V-DI deliver almost identical 
performances. 
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Figs. [2] and [3] demonstrate the L-optimality average per- 
formance metric E{Jw} versus 7. Fig. [2] corresponds to the 
L-optimality criterion based on MVU estimators and Fig. |3]is 
based on MMSE estimators. In Fig. [2] the scheme 'MVU' 
corresponds to the optimal training for channel estimation 
when the MVU estimator is used. This training is given by 
Theorem |4] for I a dm = I> i.e., Xt = I and Xr = I. 
'MVU in Subsection |IV-D[ is again the MVU estimator based 
on the same theorem but for the correct X a d m - The scheme 
'MMSE in Subsection llV-Df is given by the numerical solution 
mentioned below Theorem [5] since Wi is different than the 
cases where a closed form solution is possible. Figs. [2] and 
[3] clearly show that both the confidence ellipsoid and Markov 
bound approximations are better than the optimal training for 
standard channel estimation. Therefore, for this problem the 
application-oriented training design is superior compared to 
training designs with respect to the quality of the channel 
estimate. 

Fig. demonstrates the performance of optimal training 
designs for the MMSE estimator in the context of MMSE 
channel equalization. We assume that Rj; ^ S^, since the 
high SNR expressions for X a d m in the context of MMSE chan- 
nel equalization in Appendix U indicate that Xt = I for this 
application and according to Theorem [5] the optimal training 
corresponds to the optimal training for channel estimation in 
[8]. We observe that the curves almost coincide. Moreover, 
it can be easily verified that for MMSE channel equalization 
with the MVU estimator, the optimal training designs given 
by Theorems [2] and @] differ slightly only in the optimal power 
loading. These observations essentially show that the optimal 
training designs for the MVU and MMSE estimators in the 
classical channel estimation setup are nearly optimal for the 
application of MMSE channel equalization. This relies on the 
fact that for this particular application, Xt — I in the high 
data transmission SNR regime. 

Figs. [5] and [6] present the corresponding performances in 
the case of the ZF precoding. The descriptions of the schemes 
are as before. In Fig. [6] we assume that = Sr. The 
superiority of the application-oriented designs for the ZF 
precoding application is apparent in these plots. Here, Xt 7^ I 
and this is why the optimal training for the channel estimate 
works less well in this application. Moreover, the "ASGPP" 
is plotted for 7 > dB, since for smaller values of 7 all the 
eigenvalues of B = [cA max (Sj?X/?)Xr — R^ 1 ]+ are equal to 
zero for this particular set of parameters defining Fig. |6] 

Fig- Ill presents an outage plot in the context of the L- 
optimality criterion for the MVU estimator. We assume that 
7 = 1. We plot Pr {Jw > 1/7} versus the training power. 
This plot indirectly verifies that the confidence ellipsoid relax- 
ation of the chance constraint given by the scheme "ASGPP" 
is not as tight as the Markov bound approximation given by 
the scheme "MVU in Subsection IIV-DI '. 

Finally, Figs. [8] and [9] present the BER performance of the 
nearest neighbor rule applied to the signal estimates produced 
by the corresponding schemes in Fig. [6] when the QPSK 
modulation is used. The "Clairvoyant" scheme corresponds 
to the ZF precoder with perfect channel knowledge. The 
channel estimates have been obtained for 7 = — 10 and 5 dB, 



respectively. Even if the application-oriented estimates are 
not optimized for the BER performance metric, they lead to 
better performance than the 'Optimal MMSE in [9]' scheme 
as is apparent in Fig. [8] In Fig. [9] the performances of all 
schemes approximately coincide. This is due to the fact that 
for 7 = 5 dB all channel estimates are very good, thus leading 
to symbol MSE performance differences that have negligible 
impact on the BER performance. 

VII. Conclusions 

In this contribution, we have presented a quite general 
framework for MIMO training sequence design subject to 
flat and block fading, as well as spatially and temporally 
correlated Gaussian noise. The main contribution has been to 
incorporate the objective of the channel estimation into the 
design. We have shown that by a suitable approximation of 
J(H, H), it is possible to solve this type of problem for several 
interesting applications such as standard MIMO channel es- 
timation, L-optimality criterion, MMSE channel equalization 
and ZF precoding. For these problems, we have numerically 
demonstrated the superiority of the schemes derived in this 
paper. Additionally, the proposed framework is valuable since 
it provides a universal way of posing different estimation- 
related problems in communication systems. We have seen 
that it shows interesting promise for, e.g., ZF precoding and 
it may yield even greater end performance gains in estimation 
problems related to communication systems, when approxi- 
mations can be avoided, depending on the end performance 
metric at hand. 

Appendix I 

Approximating the performance measure for 
MMSE Channel Equalization 

In order to obtain the approximating set T> ac i m , let us first 
denote the integrand in the performance metric (l33l by 

J>;H,H) =tr(A( e ^;H,H)*. y HA H (e^;H,H)) . 

(37) 

In addition, let ~ denote an equality in which only dominating 
terms with respect to ||H|| are retained. Then, using (l32t . we 
observe that 

A(e^; H, H) = F(e JW ;H + H) - F(e ju ;U) 

~ A x H H *,y 1 - ^H H $" 1 (HH fl + HH^)*" 1 

= A a ( (l-A 3; H g ^ 1 H)^ H g $- 1 - A.H^^HH^*.- 1 ) 

(38) 

where we omitted the argument 10 for simplicity. Inserting (|38T > 
in ( f37l > results in the approximation 

J'(w;H,H) ~ A^QH^^HQ 

+ X 2 X (H i/ $ s 1 HH ff *- 1 HH H $ ! ; 1 H) 

- X X H H S^HH^^Hq) . (39) 
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To rewrite this into a quadratic form in terms of vec(H) we 
use the facts that tr(AB) = tr(BA) = vcc T (A T ) vcc(B) = 
vcc ff (A ff )vec(B) and vec(ABC) = (C T ® A)vec(B) for 
matrices A, B, and C of compatible dimensions. Hence, we 
can rewrite d39l as 



J'(lu; H, H) ~ vcc ff (H)[A2Q 2T ® vcc(H) 

+ vec H (H)[\i (H H &- 1 Hf ® fc^HH" S" 1 ] vec(H) 

- ve^imxK^UQ f ® ^H) vcc(H H ) 

- vec ff (H ff )[A3(QH ff $- 1 ) T ® H^*" 1 ] vec(H). 

(40) 

In the next step, we introduce the permutation matrix II 
defined such that vcc(H T ) = II vcc(H) for every H to rewrite 
gg as 

J'(u; H, H) ~ vcc H (H)[A 2 Q 2T ® vcc(H) 

+ vec H (H)[\i(H H 3>- 1 H) T ® ^HH^*" 1 ] vec(H) 

- vec ff (H)[A^(*. ; ; 1 HQ) T <g> ^HJII vec(H*) 

- vec H (H*)Il T l\ 3 x (QIl H &- 1 ) T ® H^*" 1 ] vec(H). 

(41) 

We have now obtained a quadratic form. Note indeed that the 
last two terms are just complex conjugates of each other and 
thus we can write them as two times their real part. 

A. High SNR analysis 

In order to obtain a simpler expression for X a[ j m , we will 
assume high SNR in the data transmission phase. We consider 
the practically relevant case where rank(H) = mm(nT,riR). 
Depending on the rank of the channel matrix H we will have 
three different cases: 

Case 1: rank(H) = tir < tit: Under this assumption, it 
can be shown that both the first and the second terms on the 
right hand side of fiTT i contribute to X a d m . We have Q — » 
and A x #- X -> (HH^)- 1 for high SNR. Here, and in 
what follows, we use IIx = XX^ to denote the orthogonal 
projection matrix on the range-space of X and 11^ = I IIx 
to denote the projection on the nullspace of X ff . Moreover, 
A^H^^H -> n H H and A^^HH^^^ 1 -> (HH^)" 1 
for high SNR. As Tl^ H + II h h = I, summing the contribu- 
tions from the first two terms in (RTt finally gives the high 
SNR approximation 



AJ®(HH ff )- 



(42) 



Case 2: rank(H) = Ur = tit-' For the non-singular 
channel case, the second term on the right hand side of 
(ETl i dominates. Here, we have A.H^^^H -> I and 
A|*- 1 HH J? #- 1 -> (HH 5 )- 1 for high SNR. Clearly, this 
results in the same expression for X a( j m as in Case 1, namely, 



A X I ® (HH ff ) _1 . 



(43) 



Case 3: rank(H) = tit < n>R: In this case, the second term 
on the right hand side of (|4TT > dominates. When rank(H) = 
n T we get AxH 3 *" 1 !! 



I and A!*- 1 ™ 73 *" 1 



$; 1/2 [$; 1/2 HH ff $; 1/2 ]**„" 1/2 for high SNR. Using these 
approximations finally gives the high SNR approximation 



1 f l-V^HHVW du 

i7T J -IT 



B. Low SNR analysis 

For the low SNR regime, we do not need to differentiate 
our analysis for the cases itt > tir and tit < tir, because 
now 3? j, — > <!>„. It can be shown that the first term on the 
right hand side of d4lT i dominates; that is, the term involving 



A 2 ((Q 2 ) T ®* 



'y 1 ) 



Moreover, Q — > I and $ 1 — > <fr„ 1 . This yields 



u 



(44) 



Appendix II 
Proof of TheoremQ] 

For the proof of Theorem Q] we require some preliminary 
results. Lemma Q] and Lemma [2] will be used to establish the 
uniqueness part of Theorem [TJ and Lemma [3] is an extension 
of a standard result in majorization theory, which is used in 
the main part of the proof. 

Lemma 1: Let D G M ,IX ™ be a diagonal matrix with 
elements dx,i > • • ■ > d n , n > 0. If U G C" x " is a unitary 
matrix such that UDU fl has diagonal (di,i, . . . , d n , n ), then 
U is of the form U = diag(iii.i, . . . ,«„,„), where \ua\ = 1 
for i = 1, . , . , n. This also implies that UDU fl = D. 

Proof: Let V = UDU ff . The equation for (V) M is 



fc=i 



dfc,fcK,fc| 5 



from which we have, by the orthonormality of the columns of 
U, that 



k 'u;,fc| 2 = 1 = ^ |Ui,fc| 2 - 



(45) 



k=l 



We now proceed by induction on j = 
the ith column of U is [0 • ■ ■ Uij 
For i 
that 



1, . . . , n to show that 
■■ 0] T with \u i4 \ = 1. 
1, it follows from (|43T > and the fact that U is unitary 



di, 



-U2A 



dl,l 

\ui,i\ 2 - 



■U n A 



+ K,lP 



1. 



However, since di t i > ■ ■ ■ > d n<n > 0, the only way to 
satisfy this equation is to have = 1 and u^i = for 

i = 2, . . . , n. Now, if the assertion holds for i = 1, , . , , k, the 
orthogonality of the columns of U implies that Ui t k+i = for 
i = 1, . . . , k, and by following a similar reasoning as for the 
case i = 1 we deduce that |itfc_|_x,fc+i| = 1 an d u^k+i = for 
i = k + 2, . . . , n. ■ 

Lemma 2: Let D G M A ' )<JV be a diagonal matrix with 
elements du > ■ ■ ■ > d N N > 0. If U G C Nxn , with 
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< TV, is such that XJ H \J = I and V = DUD" 1 (where 
D = diag(di.i, . . . , c£ n ,n)) a ls° satisfies V ff V = I, then U 
is of the form U = [diag(iti.i, . . . , u n>n ) Ojv_ m .„] T , where 
\v,i,i\ = 1 for i = 1, . . . ,n. 

Proof: The idea is similar to the proof of Lemma Q] We 
proceed by induction on the zth column of V. For the first 
column of V we have, by the orthonormality of the columns 
of U and V, that 

2 



12,2 



di. 



-"2,1 



+ 



IN.N 



di. 



-una 



\una\ 



Since d\ t i > • • • > rfjy.jv > 0, the only way to satisfy this 
equation is to have |iti,i| = 1 and itj i = for i = 2, . . . , TV. If 
now the assertion holds for columns 1 to k, the orthogonality 
of the columns of U implies that Ui k+i = for i = 1, . . . , k, 
and by following a similar reasoning as for the first column 
of U we have that | = 1 and = for i = 

k + 2,...,N. m 

Lemma 3: Let A,B G C" x " be Hermitian matrices. 
Arrange the eigenvalues oi , . . . , a n of A in a descend- 
ing order, and the eigenvalues b\ , . . . , b n of B in an as- 
cending order. Then tr(AB) > Y^=i a ^i- Furthermore, 
if B = diag(6i, . . . , b n ) and both matrices have distinct 
eigenvalues, then tr(AB) = J^ILi a i^i if ar, d on ly if A = 
diag(ai,...,a„). 

Proof: See [34, Theorem 9.H.l.h] for the proof of 
the first assertion. For the second part, notice that if B = 
diag(6i, 



. , b n ), then by [34, Theorem 6.A.3] 

n n 

tr(AB)=£(A) 



i=l 



where {(A)^ l i]}i=i,...,„ denotes the ordered set 
{(A)i i, . . . , (A)„.„} sorted in descending order. Since 
{(■^-)[i,i]}i=i,...,n is majorized by {a\, . . . ,a n }, and the fej's 
are distinct, we can use [34, Theorem 3. A. 2] to show that 



E( A ) 



i,i]<- 



> ^2 aA 

i=l 



unless (AV^ji = a 2 ; for every i = l,...,n. Therefore, 
tr(AB) = X^r=i a ibi if anc l on iy if tne diagonal of A 
is (ai,...,a n ). Now we have to prove that A is actually 
diagonal, but this follows from Lemma Q] ■ 

Proof of Theorem [7] First, we simplify the expressions in 
(|2TT >. Using the eigendecompositions in (l23l of A and B, we 
see that 

PA -ipff y B ^ PUaD^ 1 ^ V h y UflDgUf 
& Uf PU A D^!uf P H V B h D B . 



Now, define P = UHPUaD^ 1 ^ 2 and observe that 



tr(PP H ) = tr[(U B PD- if/2 Uf)(U B PD^^U^) i 
= trCUsPD^P^Uf ) = tr^PD^ 1 ] 



-H/2 ri H\H 



Therefore, (|2TT i is equivalent to 



minimize tr(P ff PD , 1 ) 

PeC „xN 

s.t. PF^Dr. 



(46) 



To further simplify our problem, consider the singular value 



decomposition P = USV H , where U e 



and V G 



iNxN 



are unitary matrices and S has the structure 



0"! 







or S 














depending on whether TV > n or N < n. The singular values 
are ordered such that oi > • ■ ■ > a m > 0. Now, observe that 
d46b is equivalent to 



minimize tr(V ff S^SV^DT 1 
s.t. USS H U H ^D R . 



(47) 



With this problem formulation, it follows (from Sylvester's 
law of inertia [35]) that we need m > rank(Ds) to achieve 
feasibility in the constraint (i.e., having at least as many non- 
zero singular values of S as non-zero eigenvalues in Db). This 
corresponds to the condition TV > rank(B) in the theorem. 

Now we will show that U and V can be taken to be the 
identity matrices. Using Lemma [3] the cost function can be 
lower bounded as 

n 

tr(VS H EV H D^) > ^A„_ J+1 (D.4)A,(V£ ff £V ff ) 



(48) 



where Xj(-) denotes the jth largest eigenvalue. The equality 
is achieved if V = I, and observe that we can select V in this 
manner without affecting the constraint. 

To show that U can also be taken as the identity matrix, 
notice that the cost function in d47l ) does not depend on 
U, while the constraint implies (by looking at the diagonal 
elements of the inequality and recalling that U is unitary) that 



>(D 



Bli.i, 



1 = 1, 



. ,m, 



(49) 



requiring m > rank(Ds). Suppose that U and S minimize 
the cost. Then, we can replace U by I and satisfy the 
constraint, without affecting the cost in (148V This means that 
there exists an optimal solution with U = I. 

With U = I and V = I, the problem (|47| | is equivalent (in 
terms of S) to 



minimize 

CTl>0,... : cr m >0 

s.t. 



<yf > (Db)m, 



It is easy to see that the optimal solution for this problem is 
<r° pt = \J (Dsjy, i = 1, ... ,771. By creating an optimal S, 
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denoted as X opt , with the singular values <r° pt , . . . 
achieve an optimal solution 



opt 



P opt = UpPD^Uf = UsE^'D^ U2 = U B DpU^ 

with Dp as stated in the theorem. 

Finally, we will show how to characterize all optimal 
solutions for the case when A and B have distinct non-zero 
eigenvalues (thus, m = n). The optimal solutions need to give 
equality in (gD and thus Lemma [3] gives that VSS^V^ 
is diagonal and equal to Lemma Q] then implies that 

V = diag(v M , . . .,v n>n ) with \v iti \ = 1 for i = 1, . . . ,n. 

For the optimal S, we have that af = (Dp)j ,i f° r i = 
1, . . . , n, so the diagonal elements of USS ff U fl — Dp are 
zero. Since USS ff U fl — Dp y for every feasible solution 
of U has to satisfy U£S H U H = Dp. Lemma [2] 

then establishes that the first n columns of U are of the 
form [diag(iti i i, . . .,u n>n ) OAr_„ l; „] T , where |u M | = 1 for 
i = 1, . . . , n. Since U has to be unitary, and its last N — n + 1 
columns play no role in P (due to the form of X), we can take 
them as [O n ,jv_„ i+ i lAr- m +i] T without loss of generality. 

Summarizing, an optimal solution is given by (123V When A 
and B have distinct eigenvalues, V and U can only multiply 
the columns of and Up, respectively, by complex scalars 
of unit magnitude. 

Appendix III 
Proof of Theorem[2]and Theorem[3] 

Before proving Theorem [2] and [3] a lemma will be given that 
characterizes equivalences between different sets of feasible 
training matrices P. 

Lemma 4: Let B g C nxn and C e C mxm be Hermitian 
matrices, and / : C nxN -> C nxn be such that /(P) = 
f(P) H . Then, the following sets are equivalent 

{P|/(P) ® 1 1 B ® C} = {P|/(P) h A raa x(C)B}. (50) 
Proof: The equivalence will be proved by showing that 
the left hand side (LHS) is a subset of right hand side (RHS), 
and vice versa. First, assume that /(P) >z A max (C)B, then 

/(P)®Ih A max (C)B®I 

= (B® A max (C)I) h (B®C) 

Hence, RHS C LHS. 

Next, assume that /(P) ® I <r B ® C, but for the purpose 
of contradiction that /(P) ^ A max (C)B. Then, there exists a 
vector x such that x H (/(P) - A max (C)B)x < 0. Let v be 
an eigenvector of C that corresponds to A max (C) and define 
y = x (g) v. Then 

y(/(P)®I-B®C)y 

= (x ff /(P)x)||v|| 2 - (x H Bx)(v"Cv) (52) 

= x if (/(P)-A max (C)B)x||v|| 2 <0 

which is a contradiction. Hence, LHS C RHS. ■ 
Proof of Theorem [2] Rewrite the constraint as 

Sp) _1 P h &t®X r 
(PS^P^) 7 ® S^ 1 h cT T T ® Xp 
(PS Q 1 P ff )®I> = cXp®SpXp. 



(51) 



(53) 



Let /(P) = PS^P^ . Then Lemma gives that the set 
of feasible P is equivalent to the set of feasible P with the 
constraint 



(PS^P^) h cA max (SpXp)X T . 



(54) 



Proof of Theorem |3] 

In the case that Rp = Sp, the constraint can be rewritten 

as 

(PS^P" +R- 1 ) T ®I^cX£®SpXp. (55) 

With /(P) = PS^P* + R T \ Lemma g] can be applied to 
achieve the equivalent constraint 

R-p 1 y cA max (SpXp)Xp 
«• PS^P" h cA max (SpXp)X T - Ry 1 (56) 
PSq X P h >r [cA max (SpXp)X T - r^ 1 ^ 

where the last equality follows from the fact that the left hand 
side is positive semi-definite. 

as 



PS^P" 



In the case that R,-, 1 = Xp, the constraint can be rewritten 



(PS^pVf ® S^ 1 h (cX T - R T ) T ® Xp 



^ (PS^P^ ) T ® s^ 1 y [cZ T - Rt]+ ® Xp. 



(57) 



Observe that this expression is identical to the constraint 
in ( l24b . except that the positive semi-definite Xp has been 
replaced by [cXp — Rp] + . Thus, the equivalence follows 
directly from Theorem |2] 

Xp, the constraint can be rewritten as 



In the case Rr, 



(ps 1 p h ) t ® s^ 1 y x T T ® (cXp - Rp) 



Rr 



& (PS^P^ ) T ® S^ 1 >r X T T ® [cXj: 



(58) 



As in the previous case, the equivalence follows directly from 
Theorem [2] 

Appendix IV 
Proof of Theorem[4] 

Our basic assumption is that Xp,Xp are both Hermitian 
matrices, which is encountered in the applications presented 
in this paper. Denoting by P' the matrix P T and using the 
fact thafl X adra = (X T ® Xp) 1/2 (l' T ® Xp) 1/2 , it can be 
seen that our optimization problem takes the following form 



minimize 

p'ec Bx ™r 



s.t. 



< V 



J(H) 
tr(P'P' H 

where J(H) = Eg |j(H,H)| is given by the expression 



(59) 



tr 



= tr 



T '-l/2p/Hc/-lp/ T '-l/2 „ T -l/2c_l T -l/2 
T" Q T Ft Ft Ft 



XI— 1/2-p/ifQ/ — l-p/'j-/— 1/2 
T" -t O Q JT -Li rp 



4 For a Hermitian positive semidefinite matrix A, we consider here that 
A 1 / 2 is the matrix with the same eigenvectors as A and eigenvalues the 
square roots of the corresponding eigenvalues of A. With this definition of 
the square root of a Hermitian positive semidefinite matrix, it is clear that 
A i/2 = A H/2 leadi to A = A 1 / 2 A H ' 2 = A H / 2 A 1 / 2 . 
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Using the fact that tr (A ® B) = tr (A) tr (B) for square 
matrices A and B, it is clear from the last expression that the 
optimal training matrix can be found by minimizing 



tr 



V2p/» S /-lp/ I '- 



1/2 



(60) 



where Vt denotes the modal matrix of X T corresponding to 
an arbitrary ordering of its eigenvalues. Here, we have used the 
invariance of the trace operator under unitary transformations. 
First, note that for an arbitrary Hermitian positive definite 
matrix A, tr (A -1 ) = E, l/\ (A), where A, (A) is the ith 
eigenvalue of A. Since the function 1/x is strictly convex 
for x > 0, tr (A -1 ) is a Schur-convex function with respect 
to the eigenvalues of A [34]. Additionally, for any Hermitian 
matrix A, the vector of its diagonal entries is majorized by the 
vector of its eigenvalues [34]. Combining the last two results, 
it follows that tr (A -1 ) is minimized when A is diagonal. 
Therefore, we may choose the modal matrices of P' in such 
a way that V^Z^ 1/2 p' H S'q 1 P'X^ 1/2 V t is diagonalized. 
Suppose that the singular value decomposition (SVD) of P' 
is UDp/V fl and that the modal matrix of S'q, correspond- 
ing to arbitrary ordering of its eigenvalues, is Vq. Setting 
U = V T and V = Vq, V$Xl 



Ht /-1/2tj/J? 



p'^s'^p'z: 



1-1/2 



Vt is 



diagonalized and is given by the expression 

A'^DpA^DpA" 172 . 

Here, At and Aq are the diagonal eigenvalue matrices 
containing the eigenvalues of I' T and S'q, respectively, 
in their main diagonals. The ordering of the eigenvalues 
corresponds to Vt and Vq. Clearly, by reordering the 
columns of Vt and Vq, we can reorder the eigenvalues 
in At and Aq. Assume that there are two different 
permutations 7r, vj such that 7r ((At)i.i) , • • • , 7r ((A.T)n T ,n T ) 
and w ((Aq) m ) , . . . , w ((Aq)b,b) minimize 

J(H) subject to our training energy constraint. 
Then, the entries of the corresponding eigenvalue 
matrix of V^X'~ 1/2 P' H S'q 1 P'X t " 1/2 V t are 
(T> P ,)lJ(ir((A T ) l:l )w((A Q )^)),i = 1,2,..., n T 
(B > n T ). Setting (D P ,)f. = m,i = l,2,...,n T , the 



optimization problem $59i results in 



minimize z^i=i 

7r,-CE7,Ki,i=l,2,...,nT 



(61) 

s-t. ef=i «i < t 



which leads to 



minimize ) •_, — 

Tr,vj,Ki,i=l,2,...,n T ^'-J- « 



S.t. 



i=l m 



(62) 



where a t = tt ((A T ) l j) w ((Aq);,;) , i = 1, 2, . . . , n T . Form- 
ing the Lagrangian of the last problem, it can be seen that 



(Dp,),: 



/o , 



-,i = 1,2,..., n T 



while the objective value equals to (E™=i y/^i) 1^- Using 
Lemma [3] it can be seen that 7r and vo should correspond to 
opposite orderings of (A T ) i;i , (A Q ) jJ ,i = 1, 2, . . . ,n T , j = 
1,2, ... ,B, respectively. Since B can be greater than tit, the 



eigenvalues of I' T must be set in decreasing order and those 
of S'q in increasing order. 

Appendix V 
Proof of Theorem[5] 

Using the factorization Z a( j m = 
(X' T ® I R ) 1/2 (I' T <g> I R ) 1/2 , we can see that E { J(H, H)| 
is given by the expression 



tr 



/2 



-1/2 



-1/2 



R^X 



-1/2 



X/ — 1 / 2 / H q / — 1 -p / — 1 / 2 



R -^R 



t -1/2q-1— -1/2 



(63) 



where = with eigenvalue decomposition U^A^U^?. 
This objective function subject to the training energy constraint 
seems very difficult to minimize analytically 
unless special assumptions are made. 
> Rp = Sr: Then, d63l becomes 



tr 



-l/2 R /-l T '-l/2 



rt— l/2,p/-ff q/ — 1 



p' "So p'x: 



/-1/2 



2Tp 2 R,rZ 



Using once more the fact that tr (A ® B) = 
tr (A) tr (B) for square matrices A and B, it is clear 
from (l64l l that the optimal training matrix can be found 
by minimizing 



tr 



P'^S'nV 



(65) 



Again, here some special assumptions may be of interest. 
- Xt = I: Then the optimal training matrix can be 
found by straightforward adjustment of Proposition 
2 in [8]. 

takes the form 



tr 



R /l/2p,ff s ,-lp, R /l/2 



(66) 



Using the same majorization argument as in the 
previous Appendix for tr(A _1 ) = X^l/Aj(A), 
and adopting the notation therein, we should select 
U = XJ' T and V = Vq. With these choices, the 
optimal power allocation problem becomes 



minimize 

TT,ZV,Ki ,2— 1,2,. . . ,flT 



S.t. 



EriT 
i=l 



1 + 



r ( (A r>*,i)« 



(67) 



where (A^);,;, i = 1, 2, . . . ,m are the eigenvalues 
of Il' T . Fixing the permutations ir(-) and w(-), we 
set 7, = 7r ((A T ) M ) /c7((Aq). m ) ,i = 1,2, ...,n T . 
With this notation, the problem of selecting the 
optimal Ki's becomes 



minimize ^ EZi T+k^I 
Er=i«i<^ 



«i,i=l,2 
S.t 



(68) 
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Following similar steps as in the proof of Proposition 
2 in [8], we define the following parameter 



max im € {1,2, ... , tit} '■ 



Algorithm 1 Optimal ordering for the eigenvalues of and 



Sq, when R/j = S] 



m rrr- m 

E\/ 1 -E 1 <^ fc = 1 ' 2 ' 



Then, it can be easily seen that for j = 1,2,. 
the optimal (Dp> )j j is given by the expression 




1 




while (Dp/)j.j = for j = to* + 1, . . . , ny. 

With these expressions for the optimal power allocation, 
the objective of ( l67l ) equals 



riT 



and therefore the problem of determining the optimal 
orderings n(-),w(-) becomes 



minimize 



tit 



(70) 



The last problem seems to be difficult to solve analyti- 
cally. Nevertheless, a simple numerical exhaustive search 
algorithm, namely Algorithm [TJ can solve this proble 
Note that given the fact that and B are small in 
practice, the complexity of the above algorithm and its 
necessary memory are not crucial. However, as tit and B 
increase, complexity and memory become important. In 
this case, a good solution may be to order the eigenvalues 
of H' T in decreasing order and those of Sq in increasing 
order. This can be analytically justified based on the fact 
that for a fixed to*, the objective function of problem 
( TTOl ), say MSE(7 1; . . . , j mt ), has negative partial deriva- 
tives with respect to 7j, i — 1,2,..., m* and it is also 
symmetric, since any permutation of its arguments does 
not change its value. This essentially shows that a good 
solution may maintain as active 7's the largest possible, 
through the selection of m*. Additionally, the structure 
of MSE(7i , . . . , j mt ) reveals the fact that for every new 
active 7, something less than 1 is added to the MSE, while 
an inactive value corresponds to adding 1 to the MSE. 
This is intuitively appealing with the spatial diversity 
of MIMO systems and the usual properties that optimal 
training matrices possess in such systems (i.e., that they 
tend to fully exploit the available spatial diversity). The 
largest possible 7's can be achieved with a decreasing 
order of the eigenvalues of R^ and an increasing order 
of the eigenvalues of Sq. In this case, it can be checked 

5 For easiness, we use the MATLAB notation in this table. 



and R„ 



Require: ut,B such that B > tit, V, a row vector A^ 
containing all (A^J^i's for i = 1,2,..., tit in any 
order and a row vector Aq containing all (Aq^^'s for 
i = 1, 2, . . . , B in any order. 
1: Create two matrices TJt and IIq containing as rows all 
possible permutations of A^ and Aq, respectively. Define 
also the matrix r = [ ]. 
loop 



for Z = 1 : ny! 
loop 

for t = 1 : B\ 

T=[T;U T (l,:)./U Q (t,l:n T )}. 

loop 

For each row of T determine the corresponding m* 
and place it in the corresponding row of a new vector 
M. 
loop 
for / 



1 : n T \B\ 



J(l) =n T - M(l) + 



E 



M(0 
1 



\/r(*,i) 



•P + J2- 



M(l) 
1 



r(M) 



[val, ind] = min J 

if mod(ind, Bl) == then 

j = B\ 
else 

j =mod(ind, Bl) 
i = (hid- j)/Bl + 1 

The optimal 7r(-), say 7r opt , corresponds to Ilrih 
the optimal tn(-), say zu opt , to IIqIj, :). 



that m* can be found as follows 



and 



max < m G {1, 2, . . . , ny} 



m rrr- m 

y.\--n- 



[l] 



If the modal matrices of R# and S^ are the same, I? = 
I and Ir = I, then the optimal training is given by [9], 
as these assumptions correspond to the problem solved 
therein. 

In any other case (e.g., if Rp 7^ Sr), the (optimal) 
training can be found using numerical methods like the 
semidefinite relaxation approach described in [28]. Note 
that this approach can handle also general X a d m , not 
necessarily Kronecker-structured. 
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